In this assignment, I would like to see the correlation of diabetes and obesity with physical inactivity in the US in 2017. Thus, I download two datasets which are talking about “diagnosed diabetes among adults aged >=18 years” and “Obesity among adults aged >=18 years” in the US in 2017 from the CDC. They include estimates for the 500 largest US cities and approximately 28,000 census tracts within these cities.
I used API method to obtain my datasets from CDC. First, you have to create an account with password. Then, you have to apply for a free app token. Last, copy your API Endpoint. Both datasets contain 27 columns and 29,006 rows.
Here are my datasets links:
https://chronicdata.cdc.gov/500-Cities-Places/500-Cities-Obesity-among-adults-aged-18-years/bjvu-3y7d
https://chronicdata.cdc.gov/500-Cities-Places/500-Cities-Diagnosed-diabetes-among-adults-aged-18/cn78-b9bj
From CDC datasets, I select data_value(%), populationCount, stateabbr, statedesc(state name), city_name, geolocation.latitude, and geolocation.longitude total 7 columns.
I change my column names in order to easily understand. In diabetes and obesity datasets, I have diabetes_percentage/obesity_percentage, populationCount, state_abbr, state_name, city_name, lat, and lon.
colnames(dia_mini)[1] <- "diabetes_percentage"
colnames(dia_mini)[2] <- "populationCount"
colnames(dia_mini)[3] <- "state_abbr"
colnames(dia_mini)[4] <- "state_name"
colnames(dia_mini)[5] <- "city_name"
colnames(dia_mini)[6] <- "lat"
colnames(dia_mini)[7] <- "lon"
colnames(obe_mini)[1] <- "obesity_percentage"
colnames(obe_mini)[2] <- "populationCount"
colnames(obe_mini)[3] <- "state_abbr"
colnames(obe_mini)[4] <- "state_name"
colnames(obe_mini)[5] <- "city_name"
colnames(obe_mini)[6] <- "lat"
colnames(obe_mini)[7] <- "lon"
Merge CDC two datasets by state_abbr, populationCount, state_name, city_name, lat, and lon. Then I merge the Physical Inactivity dataset with them.
merged <-
merge(
# Data
x = dia_mini,
y = obe_mini,
# List of variables to match
by = c("state_abbr","populationCount", "state_name", "city_name", "lat", "lon"),
# keep everything!
all.x = TRUE
)
dim(merged)
## [1] 30008 8
My row number increased to 30,008 so I have to remove duplicates.
merged[, n := 1:.N, by = .(state_abbr, state_name, city_name, lat, lon)]
merged <- merged[n == 1,][, n := NULL]
dim(merged)
## [1] 28505 8
After removing duplicates, my rows shrink from 30,008 to 28,505.
In this step, I just convert character variables into numeric variables.
merged$lat <- as.numeric(merged$lat)
merged$lon <- as.numeric(merged$lon)
merged$diabetes_percentage <- as.numeric(merged$diabetes_percentage)
merged$populationCount <- as.numeric(merged$populationCount)
merged$obesity_percentage <- as.numeric(merged$obesity_percentage)
In my merged dataset, there are only 2.7% NAs values in columns of diabetes_percentage and obesity_percentage. Therefore, I’m going to remove them.
mean(is.na(merged$diabetes_percentage))
## [1] 0.02785476
mean(is.na(merged$obesity_percentage))
## [1] 0.02785476
merged <-merged[!is.na(merged$diabetes_percentage),]
merged <-merged[!is.na(merged$obesity_percentage),]
Last, I create a new column contain Northeast, Northwest, Southwest, and Southeast four different regions.
# Add regions
merged[, region := fifelse(lon >= -98 & lat > 39.71, "NE",
fifelse(lon < -98 & lat > 39.71, "NW",
fifelse(lon < -98 & lat <= 39.71, "SW","SE")))
]
table(merged$region)
##
## NE NW SE SW
## 8768 1721 8895 8326
pal_dia <- colorNumeric(c('darkblue','goldenrod','darkred'), domain=merged$diabetes_percentage)
# Diabetes percentage in the US
p1_leaflet <- leaflet() %>%
addProviderTiles('OpenStreetMap') %>%
addCircles(data = merged,
lat=~lat,lng=~lon,
label = ~paste0(round(diabetes_percentage,2)), color = ~ pal_dia(diabetes_percentage),
opacity = 0.5, fillOpacity = 1, radius = 50) %>%
# Legend
addLegend('bottomleft', pal=pal_dia, values=merged$diabetes_percentage,
title='Diabetes percentage', opacity=1)
p1_leaflet
From the Leaflet, the legend shows the degree of the diabetes
percentage. The red color means higher percentage of diabetes. I see
there are more orange dots in the NE region and SE region from the plot
of diabetes percentage.
merged$region <- factor(merged$region, levels=c("NE", "SE", "NW", "SW"))
p1_box <- merged[!is.na(diabetes_percentage)][!is.na(region)] %>%
plot_ly(x = ~region, y= ~diabetes_percentage,
type = 'box', mode = 'markers', color = ~region,
hoverinfo = 'text',
text = ~paste( paste(" State name: ", state_name, sep=""),
paste(" Region: ", region, sep=""),
paste("City name: ", city_name, sep=""),
paste(" Diabetes percentage: ", diabetes_percentage, sep=""),
sep = "<br>")) %>%
layout(title = "Diabetes percentage in different regions",
xaxis = list(title = "Regions"),
yaxis = list(title = "Diabetes percentage"),
hovermode = "compare")
p1_box
Now, let see the boxplot, the x-axis shows 4 regions: Northeast, Southeast, Northwest, and Southwest. On the y-axis shows the percentage of diabetes or obesity.
From the boxplot of diabetes percentage, there is a max diabetes percentage in the NE region, and the NE region and the SE region have a similar median diabetes percentage. The NW region has the lowest median diabetes percentage. In this plot, the east-side regions’ median diabetes percentage is higher than the west-side regions’.
merged_median <- merged[, .(
obe_median = median(obesity_percentage, na.rm=TRUE),
dia_median = median(diabetes_percentage, na.rm=TRUE)
),
by = c("state_abbr", "state_name", "region")
]
merged_median_uni <- unique(merged_median)
scatter_p <- merged_median_uni %>%
plot_ly(x = ~obe_median, y = ~dia_median,
type = 'scatter', mode = 'markers', color = ~state_abbr,
hoverinfo = 'text',
text = ~paste( paste(state_name, ":", sep=""),
paste(" State_abbr: ", state_abbr, sep=""),
paste(" Region: ", region, sep=""),
paste(" Obesity percentage: ", obe_median, sep=""),
paste(" Diabetes percentage: ", dia_median, sep=""),
sep = "<br>")) %>%
layout(title = "Obesity percentage vs. Diabetes percentage",
xaxis = list(title = "Obesity percentage"),
yaxis = list(title = "Diabetes percentage"),
hovermode = "compare")
scatter_p
In this scatter plot, I select each state’s median of obesity percentage and diabetes percentage. We can see that there is a positive correlation between obesity and diabetes rates.
From the leaflet, first we can see there are more orange dots on the NE and SE regions. From the box plot, the median of diabetes percentage looks equally high in the NE and SE regions. Besides, we can also see there are higher diabetes percentages on the east-side than on the west-side.
From the scatter plot, we can see that there is a positive correlation between obesity and diabetes rates by states.
Copyright © 2020, Sam Lu.